Getting hold of HTS data

  • From public repositories
  • From collaborators
  • By sequencing some of your own material!

Public Repositories for HTS

  • Several public sources of HTS data exist.
  • First concentrating on those acting as repositories.
    • GEO (Gene Expression Omnibus)
    • ENA (European Nucleotide Database)
    • SRA (Short Read Archive)

Gene Expression Omnibus

.pull-left[ igv] .pull-right[ - GEO holds different types of biological datasets. - Very popular for submission of data accompanying publication. - Captures metadata, processed files and raw data. - GEO was not built for HTS data ]


Gene Expression Omnibus


Short Read Archive

  • SRA (www.ncbi.nlm.nih.gov/sra)

.pull-left[ igv] .pull-right[ - NCBI’s HTS specific repository. - Sequencing specific metadata. - Stores Raw data (in SRA format) - SRA format - requires SRA Toolkit]


Short Read Archive

  • SRA (www.ncbi.nlm.nih.gov/sra)

European Nucleotide Archive

.pull-left[ igv] .pull-right[ - ENA acts as a european HTS repository. - Mirrors much of SRA. - Stores Raw data - No SRA formats - fastq by default.]


European Nucleotide Archive


Other Repositories

.pull-left[ igv

] .pull-right[ - Many repositories contain processed or unprocessed data. - These typically are the result or a consortium’s data release policies. - Good example is Encode site. - (https://www.encodeproject.org/) - UCSC has many useful links to genomics data in various formats. - (http://hgdownload.soe.ucsc.edu/downloads.html)]


Encode Portal

Encode portal provides access to raw and processed/standardised results.


Repositories for processed data

.pull-left[ igv igv

] .pull-right[ - Other specialist repositories exist. - ReCount2 database provides standardised counts for user analysis. - Other databases like Immgen/Bodymap/expression atlas provide RNAseq for specific cells/tissues.]


Reference data

  • Reference Genome available from many locations.
  • Different assemblies
    • Major Revisisons - Change locations
    • Minor Revisions - Update annotation
  • Genome sequence stored as FASTA.
  • Gene build as GFF3 or GTF.
  • IGenomes contains full annotation files for many genomes.